fix: use buffered and script based get-reads implementation #126

johanneskoester · 2025-05-20T07:59:22Z

Summary by CodeRabbit

New Features
- Added a new script to extract paired-end reads from BAM files and output them as FASTQ files.
Improvements
- Updated environment dependencies to include an additional required package.
- Enhanced the read extraction process for improved accuracy and maintainability.
- Updated configuration comments to reflect a higher optional read limit.
Refactor
- Replaced previous shell-based extraction with a dedicated Python script.
- Simplified configuration and resource handling for the read extraction rule.
- Updated tool wrapper versions for improved compatibility.
- Removed deprecated read limit function to streamline processing.

coderabbitai · 2025-05-20T07:59:28Z

"""

Walkthrough

The environment configuration was updated to add the dnaio dependency. The get_read_limit_param function was removed from the workflow. The get_reads rule was refactored to use a new Python script, get-reads.py, for extracting paired-end reads from BAM files, replacing the previous shell-based approach and updating the conda environment. The samtools_faidx and index_bcf rules were updated to use newer wrapper versions. The configuration comment for limit-reads was updated to reflect a higher read limit.

Changes

File(s)	Change Summary
workflow/envs/pysam.yaml	Added `dnaio=1.2` to the environment dependencies.
workflow/rules/common.smk	Removed the `get_read_limit_param` function.
workflow/rules/download.smk	Refactored the `get_reads` rule to use `get-reads.py` script, changed conda env to `pysam.yaml`, simplified parameters, removed explicit thread resources, and updated `samtools_faidx` wrapper version.
workflow/scripts/get-reads.py	Introduced a new script to extract paired-end reads from a BAM file and write them to FASTQ using `pysam` and `dnaio`.
config/config.yaml	Updated comment for `limit-reads` to indicate a 100,000 read limit instead of 10,000; no functional change to the value.
workflow/rules/utils.smk	Updated `index_bcf` rule to use a newer wrapper version (`v6.2.0` instead of `v1.9.0`).

Sequence Diagram(s)

sequenceDiagram
    participant Snakemake
    participant get-reads.py
    participant pysam
    participant dnaio

    Snakemake->>get-reads.py: Run script with BAM input and FASTQ outputs
    get-reads.py->>pysam: Open BAM file for reading
    loop For each alignment
        get-reads.py->>get-reads.py: Buffer reads by query name
        get-reads.py->>get-reads.py: Pair mates when both found
        get-reads.py->>dnaio: Write paired reads to FASTQ
    end
    get-reads.py->>Snakemake: Log progress and completion

Poem

In the warren, scripts now hop anew,
With dnaio and pysam, the reads flow through.
No more shell commands, just Pythonic delight,
Paired-end friends find their FASTQ flight.
With dependencies set and old code shed,
The workflow leaps ahead!
🐇✨
"""

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 2e86719 and 7b2cbcc.

📒 Files selected for processing (1)

workflow/rules/utils.smk (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Testing

🔇 Additional comments (1)

workflow/rules/utils.smk (1)

20-20: Appropriate wrapper version update

Updating the bcftools index wrapper from v1.9.0 to v6.2.0 aligns with the other wrapper standardization changes in this PR. This significant version jump should provide better compatibility with newer tools and is consistent with the samtools_faidx wrapper update mentioned in the summary.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (3)

workflow/envs/pysam.yaml (1)
9-9: Add a newline at the end of the file.

The file is missing a newline character at the end, which is considered a best practice and required by some tools.
  - dnaio =1.2
+ - dnaio =1.2
+
🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)
workflow/scripts/get-reads.py (2)
3-3: Use a context manager for opening the log file.

Using a context manager ensures the file is properly closed even if exceptions occur.
-sys.stderr = open(snakemake.log[0], "w")
+with open(snakemake.log[0], "w") as log_file:
+    sys.stderr = log_file
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

35-37: Make the read1/read2 ordering more robust.

The current code only checks if the alignment is read2, but doesn't verify the other alignment is read1. This could lead to incorrect pairing in some edge cases.
-            if aln.is_read2:
+            if aln.is_read2 and mate_aln.is_read1:
                 aln, mate_aln = mate_aln, aln
+            elif aln.is_read1 and mate_aln.is_read2:
+                # Already in correct order
+                pass
+            else:
+                # Handle unexpected read flags
+                continue

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6dc4dca and b415c7c.

📒 Files selected for processing (4)

workflow/envs/pysam.yaml (1 hunks)
workflow/rules/common.smk (0 hunks)
workflow/rules/download.smk (1 hunks)
workflow/scripts/get-reads.py (1 hunks)

💤 Files with no reviewable changes (1)

workflow/rules/common.smk

🧰 Additional context used

🪛 YAMLlint (1.37.1)

workflow/envs/pysam.yaml

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

🔇 Additional comments (3)

workflow/envs/pysam.yaml (1)

9-9: Addition of dnaio looks good!

The dnaio package is correctly added to support the new Python script-based approach for handling FASTQ files.

🧰 Tools

🪛 YAMLlint (1.37.1)

[error] 9-9: no new line character at the end of file

(new-line-at-end-of-file)

workflow/rules/download.smk (2)

6-6: Parameter handling looks good.

The branch function effectively replaces the previous get_read_limit_param approach, providing a cleaner way to conditionally set the read limit.

11-14: Good refactoring to use a Python script instead of shell commands.

Switching from a shell-based approach to a dedicated Python script improves maintainability and reliability for handling paired reads. The conda environment has been appropriately updated to include the required dependencies.

workflow/scripts/get-reads.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

workflow/rules/download.smk

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (4)

workflow/scripts/get-reads.py (4)
1-7: Move imports to the top of the file.

Following Python best practices (PEP 8), all imports should be placed at the top of the file before any other code. The current placement interrupts the import section with stderr redirection.
 import sys
+import pysam
+import dnaio
 
 sys.stderr = open(snakemake.log[0], "w")
-
-import pysam
-import dnaio
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

3-3: Use a context manager for the log file.

The current approach of directly reassigning sys.stderr doesn't ensure the file is properly closed when the script ends. Using a context manager or at least closing the file at the end of the script is recommended.
-sys.stderr = open(snakemake.log[0], "w")
+# Option 1: Redirect within a try-finally to ensure closing
+log_file = open(snakemake.log[0], "w")
+sys.stderr = log_file
+try:
+    # Rest of your code would be indented here
+finally:
+    log_file.close()
+
+# Option 2: Use a custom context manager for stderr redirection
+class RedirectStderr:
+    def __init__(self, file_path):
+        self.file_path = file_path
+        self.original_stderr = None
+        self.file_obj = None
+    
+    def __enter__(self):
+        self.file_obj = open(self.file_path, "w")
+        self.original_stderr = sys.stderr
+        sys.stderr = self.file_obj
+        return self.file_obj
+    
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        sys.stderr = self.original_stderr
+        if self.file_obj:
+            self.file_obj.close()
+
+with RedirectStderr(snakemake.log[0]):
+    # Rest of your code would be here
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

38-40: Add orphaned read handling at end of file processing.

Currently, any reads in the buffer without mates at the end of processing are silently ignored. Consider handling these orphaned reads by either logging them or optionally writing them as single-end reads.
            fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln))
            n_written += 1
+
+# Add this at the end, after the loop completes
+if buffer:
+    print(f"Warning: {len(buffer)} unpaired reads found in BAM file", file=sys.stderr)
+    # Optionally, write unpaired reads as single-end if needed
+    # for orphaned_aln in buffer.values():
+    #     fqwriter.write(aln_to_fq(orphaned_aln), None)
40-40: Explicitly close the BAM file.

The BAM file should be explicitly closed after processing to ensure proper resource cleanup, especially for large files.
            fqwriter.write(aln_to_fq(aln), aln_to_fq(mate_aln))
            n_written += 1
+
+# Close the BAM file after processing
+bam.close()

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between b415c7c and 27bad31.

📒 Files selected for processing (2)

workflow/rules/download.smk (1 hunks)
workflow/scripts/get-reads.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

workflow/rules/download.smk

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

🔇 Additional comments (2)

workflow/scripts/get-reads.py (2)

24-25: LGTM!

The condition to check if limit is None before comparing has been properly implemented, addressing a previous issue.

26-27: LGTM!

The condition to skip secondary and supplementary alignments is now correct, addressing a previous issue.

workflow/scripts/get-reads.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (3)

workflow/scripts/get-reads.py (3)
9-14: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.query_qualities could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.
 def aln_to_fq(aln):
+    # Handle case when query_qualities is None
+    qualities = aln.query_qualities
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+    
     return dnaio.SequenceRecord(
         name=aln.query_name,
         sequence=aln.get_forward_sequence(),
-        quality=aln.query_qualities,
+        quality=qualities,
     )
17-18: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

29-33: Fix inconsistent query name usage in buffer handling.

The code computes qname by removing suffixes but then doesn't use it for the buffer lookup or storage, which could lead to missing mate pairs if the suffixes are inconsistently present.
-        qname = aln.query_name.removesuffix("/1").removesuffix("/2")
+        # Remove potential read suffixes for consistent mate pairing
+        qname = aln.query_name.removesuffix("/1").removesuffix("/2")
 
-        mate_aln = buffer.get(aln.query_name)
+        mate_aln = buffer.get(qname)
         if mate_aln is None:
-            buffer[aln.query_name] = aln
+            buffer[qname] = aln

🧹 Nitpick comments (3)

workflow/scripts/get-reads.py (3)
3-3: Use a context manager for file operations.

This line opens a file for stderr redirection but doesn't handle closing it properly. Using a context manager ensures the file is properly closed even if an exception occurs.
-sys.stderr = open(snakemake.log[0], "w")
+# Use context manager or explicit close to prevent resource leaks
+sys.stderr = open(snakemake.log[0], "w")
Alternatively, consider using a context manager:
import contextlib
with open(snakemake.log[0], "w") as log_file:
    with contextlib.redirect_stderr(log_file):
        # Rest of your code
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

18-18: Close BAM file explicitly after processing.

The script doesn't explicitly close the BAM file after processing, which could lead to resource leaks. Consider using a context manager or explicitly closing the file.
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+# Use context manager to ensure proper resource cleanup
+with pysam.AlignmentFile(snakemake.params.bam_url) as bam:
+    # Move the processing code inside this context
🧰 Tools

🪛 Ruff (0.11.9)

18-18: Undefined name snakemake

(F821)

1-40: Add docstring and include error handling for the script.

The script lacks a docstring explaining its purpose and usage. Additionally, it could benefit from more comprehensive error handling, especially for the main processing loop.
 import sys
+"""
+Extract paired-end reads from a BAM file and write them to FASTQ files.
+
+This script uses pysam to read alignments from a BAM file and dnaio to write
+paired FASTQ output files. It buffers alignments by query name to pair mates,
+skipping secondary and supplementary alignments.
+
+Input:
+    - BAM file URL (from snakemake.params.bam_url)
+    - Optional read limit (from snakemake.params.limit)
+
+Output:
+    - Paired FASTQ files (from snakemake.output[0] and snakemake.output[1])
+    - Log file (from snakemake.log[0])
+"""
Additionally, consider adding more robust error handling throughout:
try:
    with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
        # Main processing code
except Exception as e:
    print(f"Error processing BAM file or writing FASTQ: {e}", file=sys.stderr)
    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 27bad31 and 7f7a4c6.

📒 Files selected for processing (1)

workflow/scripts/get-reads.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

workflow/scripts/get-reads.py (2)
9-14: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.query_qualities could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.
 def aln_to_fq(qname, aln):
+    # Handle case when query_qualities is None
+    qualities = aln.query_qualities
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities=aln.query_qualities,
+        qualities=qualities,
     )
18-18: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

18-18: Undefined name snakemake

(F821)

🧹 Nitpick comments (1)

workflow/scripts/get-reads.py (1)
3-3: Use a context manager for file handling.

The file is opened without using a context manager, which could lead to resource leaks if the script exits unexpectedly.
-sys.stderr = open(snakemake.log[0], "w")
+with open(snakemake.log[0], "w") as log_file:
+    sys.stderr = log_file
+    
+    # Main script logic should be indented here
+    # Or use a separate function that's called within this context
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7f7a4c6 and e063c93.

📒 Files selected for processing (1)

workflow/scripts/get-reads.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

17-17: Undefined name snakemake

(F821)

18-18: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

22-22: Undefined name snakemake

(F821)

🔇 Additional comments (2)

workflow/scripts/get-reads.py (2)

24-27: Fixed issues from previous review.

The code now correctly:

Checks if limit is None before comparing with n_written

Uses the correct condition to skip secondary and supplementary alignments

Great job addressing these issues!

29-39: Correct implementation of read pair matching.

The code now properly:

Removes suffixes from query names for consistent matching

Uses the normalized query name for buffer lookup and storage

Ensures correct ordering of read pairs before writing

This implementation effectively handles paired-end reads from BAM files.

coderabbitai

Actionable comments posted: 1

🔭 Outside diff range comments (1)

workflow/scripts/get-reads.py (1)

22-42: 🛠️ Refactor suggestion

Implement buffer cleanup mechanism.

The current implementation stores unpaired reads in the buffer but never clears them, which could lead to memory issues with large BAM files containing many unpaired reads.

Consider implementing one of these approaches:

Periodically check and remove old entries from the buffer
Set a maximum buffer size and handle overflow
After processing all alignments, report any unpaired reads left in the buffer

 buffer = {}
 n_written = 0
+buffer_limit = 1000000  # Set a reasonable buffer size limit
+
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
         if limit is not None and n_written >= limit:
             break
         if aln.is_secondary or aln.is_supplementary:
             continue
 
         qname = aln.query_name.removesuffix("/1").removesuffix("/2")
 
         mate_aln = buffer.get(qname)
         if mate_aln is None:
             buffer[qname] = aln
+            # Check if buffer size exceeds limit
+            if len(buffer) > buffer_limit:
+                print(f"Warning: Buffer size exceeds {buffer_limit}. Some reads may be unpaired.", 
+                      file=sys.stderr)
         else:
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
 
             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1
+            # Remove entry from buffer after successful write
+            del buffer[qname]
+
+# Report statistics on unpaired reads
+if buffer:
+    print(f"Warning: {len(buffer)} reads remained unpaired after processing.", file=sys.stderr)

🧰 Tools

🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

♻️ Duplicate comments (2)

workflow/scripts/get-reads.py (2)
9-16: Handle potential None quality scores in aln_to_fq function.

The function doesn't handle the case where aln.get_forward_qualities() could be None, which might happen with some BAM files. This could lead to unexpected errors when creating the SequenceRecord.
 def aln_to_fq(qname, aln):
+    # Handle case when get_forward_qualities returns None
+    qualities = aln.get_forward_qualities()
+    if qualities is None:
+        qualities = [0] * len(aln.get_forward_sequence())
+        quality_str = "!" * len(aln.get_forward_sequence())  # '!' is the lowest quality character
+    else:
+        quality_str = "".join(map(lambda qual: chr(qual + 33), qualities))
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities="".join(
-            map(lambda qual: chr(qual + 33), aln.get_forward_qualities())
-        ),
+        qualities=quality_str,
     )
19-21: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file URL. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
+import os
+
 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

🧹 Nitpick comments (2)

workflow/scripts/get-reads.py (2)

23-27: Consider adding progress reporting.

For large BAM files, processing can take a significant amount of time. Adding progress reporting would help monitor script execution.

 buffer = {}
 n_written = 0
+report_interval = 100000  # Report every 100k read pairs
+processed = 0
+
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
+        processed += 1
+        if processed % report_interval == 0:
+            print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr)
+
         if limit is not None and n_written >= limit:
+            print(f"Reached limit of {limit} read pairs. Stopping.", file=sys.stderr)
             break

🧰 Tools

🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

37-41: Validate read pairing before writing.

The code assumes that if one read is marked as read2, then the other must be read1, but it doesn't explicitly verify this. Adding validation would prevent potential errors.

         else:
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
 
+            # Validate proper read pairing
+            if (aln.is_read1 and mate_aln.is_read2) or (aln.is_read2 and mate_aln.is_read1):
+                # Reads are properly paired
+                pass
+            else:
+                print(f"Warning: Improper read pairing for {qname}. Flags: {aln.flag}, {mate_aln.flag}", 
+                      file=sys.stderr)
+            
             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between e063c93 and d0820d3.

📒 Files selected for processing (1)

workflow/scripts/get-reads.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

workflow/scripts/get-reads.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

workflow/scripts/get-reads.py (2)
1-7: Improve file handling with context managers and error handling

The current implementation directly assigns to sys.stderr without proper error handling or cleanup mechanism.
 import sys
+import os
 
-sys.stderr = open(snakemake.log[0], "w")
+# Redirect stderr to log file using context manager
+try:
+    sys.stderr = open(snakemake.log[0], "w", buffering=1)
+except (IOError, OSError) as e:
+    print(f"Error opening log file: {e}", file=sys.stderr)
+    sys.exit(1)
 
 import pysam
 import dnaio
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

19-21: Add error handling for BAM file operations

The current code lacks error handling when opening and accessing the BAM file, which could lead to uncaught exceptions.
 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

🧹 Nitpick comments (3)

workflow/scripts/get-reads.py (3)
24-29: Add progress reporting for long-running operations

For large BAM files, there's no indication of progress, which can make it difficult to monitor the script's execution.
 n_written = 0
+processed = 0
+report_interval = 100000  # Report progress every 100k reads
 with dnaio.open(snakemake.output[0], snakemake.output[1], mode="w") as fqwriter:
     for aln in bam:
+        processed += 1
+        if processed % report_interval == 0:
+            print(f"Processed {processed} alignments, written {n_written} pairs", file=sys.stderr)
         if limit is not None and n_written >= limit:
+            print(f"Reached limit of {limit} pairs, stopping", file=sys.stderr)
             break
         if aln.is_secondary or aln.is_supplementary:
             continue
🧰 Tools

🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

38-44: Validate read pairs before writing

The current implementation assumes that reads with the same name belong to the same fragment, but additional validation could ensure correct pairing.
         else:
+            # Ensure reads are properly paired
+            if not (aln.is_read1 and mate_aln.is_read2) and not (aln.is_read2 and mate_aln.is_read1):
+                print(f"Warning: Reads with name {qname} don't appear to be proper pairs based on flags", file=sys.stderr)
+            
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
             del buffer[qname]

             fqwriter.write(aln_to_fq(qname, aln), aln_to_fq(qname, mate_aln))
             n_written += 1
1-3: Import snakemake explicitly for clarity

The script uses the snakemake variable without importing it, which works in Snakemake's execution environment but reduces code clarity.
 import sys
+# The snakemake variable is implicitly available when executed through Snakemake
+# Adding this comment improves clarity for developers unfamiliar with Snakemake
 
 sys.stderr = open(snakemake.log[0], "w")
🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between d0820d3 and 0f7538f.

📒 Files selected for processing (1)

workflow/scripts/get-reads.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

workflow/scripts/get-reads.py

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (2)

workflow/scripts/get-reads.py (2)
9-16: 🛠️ Refactor suggestion

Handle potential None qualities in the aln_to_fq function.

The function doesn't handle the case where aln.get_forward_qualities() could be None or empty, which can happen with some BAM files.
 def aln_to_fq(qname, aln):
+    # Handle case when forward qualities might be None
+    qualities = aln.get_forward_qualities()
+    if qualities is None:
+        # Default to quality score of 30 (ASCII '?') for all bases
+        quality_str = "?" * len(aln.get_forward_sequence())
+    else:
+        quality_str = "".join(map(lambda qual: chr(qual + 33), qualities))
+    
     return dnaio.SequenceRecord(
         name=qname,
         sequence=aln.get_forward_sequence(),
-        qualities="".join(
-            map(lambda qual: chr(qual + 33), aln.get_forward_qualities())
-        ),
+        qualities=quality_str,
     )
1-7: 🛠️ Refactor suggestion

Use a context manager for the log file.

The current code directly assigns to sys.stderr without ensuring proper file closure. Also, the script uses snakemake which is implicitly provided in the Snakemake environment but not imported explicitly.
 import sys
 
-sys.stderr = open(snakemake.log[0], "w")
+# Use context manager for log file with line buffering
+log_file = open(snakemake.log[0], "w", buffering=1)
+sys.stderr = log_file
 
 import pysam
 import dnaio
The code should also close the log file at the end of script execution or restore the original stderr.

🧰 Tools

🪛 Ruff (0.11.9)

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

🧹 Nitpick comments (2)

workflow/scripts/get-reads.py (2)
19-20: Add error handling for BAM file opening.

The current code doesn't handle potential errors when opening the BAM file. This could lead to uncaught exceptions if the URL is invalid or the file is inaccessible.
 limit = snakemake.params.limit
-bam = pysam.AlignmentFile(snakemake.params.bam_url)
+try:
+    bam = pysam.AlignmentFile(snakemake.params.bam_url)
+except (IOError, ValueError) as e:
+    print(f"Error opening BAM file: {e}", file=sys.stderr)
+    sys.exit(1)
🧰 Tools

🪛 Ruff (0.11.9)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

24-44: The read pairing logic looks good but consider adding more validation.

The read pairing logic is sound and handles the use case correctly. However, consider adding validation to ensure that paired reads actually belong together (e.g., checking read flags or positions).
         else:
+            # Validate that reads are properly paired
+            if not (aln.is_paired and mate_aln.is_paired):
+                print(f"Warning: Read {qname} has inconsistent paired flags", file=sys.stderr)
+            
             if aln.is_read2:
                 aln, mate_aln = mate_aln, aln
             del buffer[qname]
🧰 Tools

🪛 Ruff (0.11.9)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0f7538f and 11961b9.

📒 Files selected for processing (1)

workflow/scripts/get-reads.py (1 hunks)

🧰 Additional context used

🪛 Ruff (0.11.9)

workflow/scripts/get-reads.py

3-3: Use a context manager for opening files

(SIM115)

3-3: Undefined name snakemake

(F821)

19-19: Undefined name snakemake

(F821)

20-20: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

24-24: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms (1)

GitHub Check: Formatting

🔇 Additional comments (1)

workflow/scripts/get-reads.py (1)

46-47: Good handling of orphaned reads.

The code correctly handles orphaned reads by providing a warning message. This is a good improvement from previous iterations.

…ws/dna-seq-benchmark into fix/get-reads-buffered

workflow/scripts/get-reads.py

famosab

LGTM! Should we check for updates for the wrappers in a seperate PR aswell? I noticed that you updated a few of them :)

johanneskoester · 2025-05-22T08:40:18Z

LGTM! Should we check for updates for the wrappers in a seperate PR aswell? I noticed that you updated a few of them :)

I would say that is not urgent. In particular, it would trigger reruns, which are probably a good thing before re-uploading but not while you still finalize the analysis.

fix: use buffered and script based get-reads implementation

b415c7c

coderabbitai bot reviewed May 20, 2025

View reviewed changes

workflow/scripts/get-reads.py Outdated Show resolved Hide resolved

workflow/scripts/get-reads.py Outdated Show resolved Hide resolved

johanneskoester and others added 2 commits May 20, 2025 10:09

Update workflow/scripts/get-reads.py

30b94d8

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update workflow/scripts/get-reads.py

38ee6f5

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

johanneskoester commented May 20, 2025

View reviewed changes

workflow/rules/download.smk Outdated Show resolved Hide resolved

Update workflow/rules/download.smk

27bad31

coderabbitai bot reviewed May 20, 2025

View reviewed changes

workflow/scripts/get-reads.py Outdated Show resolved Hide resolved

workflow/scripts/get-reads.py Show resolved Hide resolved

workflow/scripts/get-reads.py Outdated Show resolved Hide resolved

mode

7f7a4c6

coderabbitai bot reviewed May 20, 2025

View reviewed changes

fixes

e063c93

coderabbitai bot reviewed May 20, 2025

View reviewed changes

fix

d0820d3

coderabbitai bot reviewed May 20, 2025

View reviewed changes

workflow/scripts/get-reads.py Show resolved Hide resolved

remove read from buffer when writing fastq record

0f7538f

coderabbitai bot reviewed May 20, 2025

View reviewed changes

workflow/scripts/get-reads.py Show resolved Hide resolved

workflow/scripts/get-reads.py Show resolved Hide resolved

Update workflow/scripts/get-reads.py

11961b9

coderabbitai bot reviewed May 20, 2025

View reviewed changes

johanneskoester added 2 commits May 20, 2025 14:39

dbg

8c7c4a9

Merge branch 'fix/get-reads-buffered' of github.com:snakemake-workflo…

582e1fc

…ws/dna-seq-benchmark into fix/get-reads-buffered

famosab reviewed May 20, 2025

View reviewed changes

workflow/scripts/get-reads.py Show resolved Hide resolved

johanneskoester added 5 commits May 20, 2025 15:25

fix lookup default

9cdd2c9

assertion

3c3e533

fix

fbef952

fmt

2e86719

update wrapper

7b2cbcc

This was referenced May 21, 2025

feat: split get_reads in two rules #125

Closed

fix: remove pipefail statement and switch to wget #127

Closed

famosab approved these changes May 21, 2025

View reviewed changes

johanneskoester merged commit b9d4360 into main May 22, 2025
7 checks passed

johanneskoester deleted the fix/get-reads-buffered branch May 22, 2025 08:40

fix: use buffered and script based get-reads implementation #126

fix: use buffered and script based get-reads implementation #126

Uh oh!

Conversation

johanneskoester commented May 20, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

famosab left a comment

Choose a reason for hiding this comment

Uh oh!

johanneskoester commented May 22, 2025

Uh oh!

Uh oh!

Uh oh!

johanneskoester commented May 20, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented May 20, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)